Skip to content

ARROW-7979: [C++] Add experimental buffer compression to IPC write path. Add "field" selection to read path. Migrate some APIs to Result<T>. Read/write Message metadata#6638

Closed
wesm wants to merge 26 commits into
apache:masterfrom
wesm:ARROW-7979

Conversation

@wesm

@wesm wesm commented Mar 17, 2020

Copy link
Copy Markdown
Member

I apologize for the complexity of this patch, there was some necessary refactoring and then as I needed to add a IpcReadOptions struct, it made sense to port some APIs from Status to Result.

Summary of what's here

  • Add IpcReadOptions struct
  • Rename IpcOptions to IpcWriteOptions
  • Serialize and write custom_metadata field in Message Flatbuffer struct
  • Move MemoryPool* arguments in IPC functions to options structs
  • Adds EXPERIMENTAL compression option to IpcOptions (which should be renamed IpcWriteOptions) which causes each buffer in a record batch body to be separately compressed with e.g. ZSTD. This is intended to be used internally in Feather V2 and not exported for general use. Based on the mailing list discussion, if we were to adopt this in the IPC protocol then this can be promoted to non-experimental
  • Add "included_fields" option to select a subset of fields to load from a record batch when doing an IPC load. The motivation for this was to avoid unnecessary decompression when reading a subset of columns from an IPC stream
  • Migrate most IPC read APIs to use Result, deprecate old methods
  • Deprecate Status-returning IPC write method

Some other small changes:

  • Add check_metadata option to RecordBatch::Equals
  • Add Codec::GetCompressionType method for looking up Compression::type given a codec name

I don't have size/perf benchmarks about how the compression helps with file sizes or read performance, so I'll do that next in the course of completing ARROW-5510 ("Feather V2" file format)

@github-actions

Copy link
Copy Markdown

@wesm

wesm commented Mar 17, 2020

Copy link
Copy Markdown
Member Author

I'll fix the glib issues and other things tomorrow

@pitrou pitrou left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't read a lot of this, so just a few comments.

Comment thread cpp/src/arrow/ipc/metadata_internal.h Outdated
Comment thread cpp/src/arrow/ipc/options.h Outdated
Comment thread cpp/src/arrow/ipc/options.h Outdated
Comment thread cpp/src/arrow/ipc/reader.cc Outdated
Comment thread cpp/src/arrow/ipc/reader.cc Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are at most 3 or 4 buffers in a Arrow array, and they have very different sizes. I don't think parallelizing at this level makes sense.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. The parallelization needs to happen at the field-level, which will require some refactoring. I'll open a follow up JIRA since it's out of scope for this PR

Comment thread cpp/src/arrow/util/compression.cc Outdated
Comment thread cpp/src/arrow/util/compression_test.cc Outdated

@bkietz bkietz left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Lots of comments, mostly code style/nits

Comment thread cpp/src/arrow/dataset/file_ipc.cc Outdated
Comment on lines 44 to 46

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: avoid using Result::Value unless wrapping a result returning API into a status returning API

Suggested change
auto status = ipc::RecordBatchFileReader::Open(std::move(input)).Value(&reader);
if (!status.ok()) {
return status.WithMessage("Could not open IPC input source '", source.path(),
"': ", status.message());
}
auto maybe_reader = ipc::RecordBatchFileReader::Open(std::move(input));
auto status = maybe_reader.status();
if (!status.ok()) {
return status.WithMessage("Could not open IPC input source '", source.path(),
"': ", status.message());
}
return std::move(maybe_reader).ValueOrDie();

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is using Result::Value bad? Not obvious from reading the docstrings

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not bad, just less preferred due to mixing argout Status return and Result usage

Comment thread cpp/src/arrow/ipc/file_to_stream.cc Outdated
Comment thread cpp/src/arrow/ipc/file_to_stream.cc Outdated
Comment thread cpp/src/arrow/ipc/message.cc Outdated
Comment thread cpp/src/arrow/ipc/metadata_internal.cc Outdated
Comment thread cpp/src/arrow/ipc/reader.cc Outdated
Comment thread cpp/src/arrow/ipc/reader.cc Outdated
Comment thread cpp/src/arrow/ipc/writer.cc Outdated
Comment thread cpp/src/arrow/ipc/writer.cc Outdated
Comment thread cpp/src/arrow/ipc/writer.cc Outdated
@wesm wesm force-pushed the ARROW-7979 branch 2 times, most recently from 11fa5ad to 2d17d32 Compare March 19, 2020 21:53
@wesm

wesm commented Mar 19, 2020

Copy link
Copy Markdown
Member Author

I think I've addressed almost all of the comments. I'll try to get all the builds passing (C++/Python/R/GLib all passing for me locally)

@wesm

wesm commented Mar 20, 2020

Copy link
Copy Markdown
Member Author

Hmm looks like I broke something related to Maps

################# FAILURES #################
FAILED TEST: map C++ producing,  C++ consuming

FAILED TEST: map C++ producing,  Java consuming

FAILED TEST: map C++ producing,  JS consuming

FAILED TEST: map Java producing,  C++ consuming

FAILED TEST: map JS producing,  C++ consuming

FAILED TEST: map C++ producing,  C++ consuming

FAILED TEST: map C++ producing,  Java consuming

FAILED TEST: map Java producing,  C++ consuming

I'll fix that plus the Windows issue tomorrow. If I could get the thumbs up/down on the big picture stuff on the meantime that would be great

@bkietz bkietz left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

big picture: +1

@wesm

wesm commented Mar 20, 2020

Copy link
Copy Markdown
Member Author

@lidavidm some of the Flight integration tests failed with this error, do you know what it could be about?

  File "/arrow/dev/archery/archery/integration/tester_cpp.py", line 91, in flight_server
    raise RuntimeError(
RuntimeError: Flight-C++ server did not start properly, stdout:
/opt/conda/envs/arrow/lib/libarrow.so.100(+0x12faf48)[0x7f9280b1ef48]


stderr:
E0320 17:56:33.746761714   10694 socket_utils_common_posix.cc:222] check for SO_REUSEPORT: {"created":"@1584726993.746751914","description":"SO_REUSEPORT unavailable on compiling system","file":"../src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":190}
E0320 17:56:33.746901914   10694 server_chttp2.cc:40]        {"created":"@1584726993.746866914","description":"No address added out of total 1 resolved","file":"../src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":395,"referenced_errors":[{"created":"@1584726993.746865414","description":"Failed to add any wildcard listeners","file":"../src/core/lib/iomgr/tcp_server_posix.cc","file_line":342,"referenced_errors":[{"created":"@1584726993.746848214","description":"Address family not supported by protocol","errno":97,"file":"../src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":420,"os_error":"Address family not supported by protocol","syscall":"socket","target_address":"[::]:33625"},{"created":"@1584726993.746865114","description":"Unable to configure socket","fd":5,"file":"../src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1584726993.746862214","description":"Address already in use","errno":98,"file":"../src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":189,"os_error":"Address already in use","syscall":"bind"}]}]}]}
/arrow/cpp/src/arrow/flight/test_integration_server.cc:165:  Check failed: _s.ok() Operation failed: g_server->Init(options)
Bad status: Unknown error: Server did not start properly

@lidavidm

Copy link
Copy Markdown
Member

The relevant bit is Address already in use so it seems the Flight tests are not safe to be run in parallel, even with the port allocation code in the test runner. We should eventually have the server bind on port 0 and have the integration test runner parse the port out of its output to avoid races like this.

@lidavidm

lidavidm commented Mar 20, 2020

Copy link
Copy Markdown
Member

There's https://issues.apache.org/jira/browse/ARROW-6528 which is about unit tests, I'll add a related JIRA about fixing this in integration too.

https://issues.apache.org/jira/browse/ARROW-8176

@wesm

wesm commented Mar 20, 2020

Copy link
Copy Markdown
Member Author

@lidavidm thanks, I'll set the integration test task to run serially for now

@wesm

wesm commented Mar 20, 2020

Copy link
Copy Markdown
Member Author

The UBSAN failure is a weird one. It's fixed by the Flatbuffers 1.12 upgrade ARROW-8178, so I'll get that merged once the builds run and then rebase this

@kou kou left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to review this in a few days.

Comment thread cpp/src/arrow/ipc/writer.h Outdated
@wesm

wesm commented Mar 22, 2020

Copy link
Copy Markdown
Member Author

@nealrichardson any advice about the R test failure? The output log didn't give any clues and all the tests pass for me locally on Linux

@nealrichardson

Copy link
Copy Markdown
Member

Hmm. It looks like it just dies when loading the package to run the 32-bit version of the tests, but the 64-bit version runs fine.

@wesm

wesm commented Mar 23, 2020

Copy link
Copy Markdown
Member Author

Cross-referencing these results. The compression looks highly beneficial in terms of reduction in message size for modest decompression time

#6694 (comment)

I would guess that this would yield significant throughput benefits for Flight users. cc @lidavidm

I'll update the mailing list thread about this

@kou kou left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

We can merge this once the R problem is fixed.

Comment thread cpp/src/arrow/flight/client.cc Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add const IpcWriteOptions& options to GrpcStreamWriter::Open() instead?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should do that. It will have to happen after @lidavidm's refactoring goes in. Opening a JIRA https://issues.apache.org/jira/browse/ARROW-8190

Comment thread cpp/src/arrow/ipc/writer.h Outdated
wesm added 7 commits March 23, 2020 20:04
add empty reader_internal.h

Some more temporary cutting away / simplification

Ongoing refactoring

More refactoring and cleaning of Feather V1

More cleaning, simplification

Cleaning

fixes

Get feather.cc compiling again

Slightly more working

Feather tests passing again

Some code reorganization, simply IPC writing a bit

Start writer_internal.h

Start writer_internal.h

Incomplete refactoring

Externalize enough of IPC writer machinery to be able to customize for Feather

Remove overly elaborate externalization of IPC writer internals, implement compression as part of IpcOptions

Work on serializing custom_metadata in Message

Fix compilation

Test empty metadata case also

iwyu

Some more work on compression

complete more of decompression

Decompression draft

Implement various deprecations, and get compression tests passing

Revert Feather changes
@wesm

wesm commented Mar 24, 2020

Copy link
Copy Markdown
Member Author

This still fails.

EDIT: What I wrote about STOP_OR_VALUE was not right.

I'm not sure why this is failing. I'm going to rebase but we probably need to get some debugging help or instructions how to run this build locally on Windows

@wesm

wesm commented Mar 24, 2020

Copy link
Copy Markdown
Member Author

I built the project in 32-bit mode with -m32 on Linux and the tests pass fine. Not sure how to proceed with debugging

@nealrichardson

Copy link
Copy Markdown
Member

@wesm assuming you have a windows box with mingw stuff,

  1. Go to https://cran.r-project.org/bin/windows/ and install R
  2. Follow the Rtools link there and install Rtools 3.5
  3. Build the C++ library with makepkg-mingw following https://github.com/apache/arrow/blob/master/ci/scripts/r_windows_build.sh
  4. Set the file path of the resulting .zip file of the libs as the env var RWINLIB_LOCAL and then install the R package

See https://github.com/apache/arrow/blob/master/.github/workflows/r.yml#L157-L177

@wesm

wesm commented Mar 24, 2020

Copy link
Copy Markdown
Member Author

@nealrichardson OK I'll give it a try tomorrow. We are definitely weak when it comes to Windows issues, very few developers are equipped to debug these kinds of issues locally

@wesm

wesm commented Mar 24, 2020

Copy link
Copy Markdown
Member Author

@nealrichardson I don't have the "mingw stuff" (Python uses MSVC to build things), what other machine setup do I have to do outside of installing RTools?

@nealrichardson

Copy link
Copy Markdown
Member

I believe you need to install https://www.msys2.org/. The R Windows GHA uses https://github.com/numworks/setup-msys2 for that; Kou added some scripts ci/scripts/msys* that the Ruby jobs use.

@wesm

wesm commented Mar 24, 2020

Copy link
Copy Markdown
Member Author

I'm working on this, but I have almost no idea what I'm doing. For example, there are so many different shells (normal Windows shell, Powershell, msys2 shell), how do I get the commands listed at

https://github.com/apache/arrow/blob/master/.github/workflows/cpp.yml#L278

to work? Flying blind

@wesm

wesm commented Mar 24, 2020

Copy link
Copy Markdown
Member Author

It seems I'm supposed to be using Powershell. It would be nice at some point to have easy-to-follow source build instructions for R on Windows so the next person who runs into a problem doesn't have to dig through all these CI scripts to figure out what is what

@wesm

wesm commented Mar 24, 2020

Copy link
Copy Markdown
Member Author

I'm giving up, can't figure out how to configure my environment to be able to run https://github.com/apache/arrow/blob/master/ci/scripts/r_windows_build.sh after about an hour of trying. I will need help from someone else to find out why the 32-bit R build is broken

@nealrichardson

Copy link
Copy Markdown
Member

@wesm I've reproduced the 32-bit windows R segfault and pushed a skip for the test so that this is unblocked. The code that triggers the segfault is in the IPC format Datasets code, presumably in the scan operation because it succeeds in instantiating the dataset. @bkietz maybe you can take a look too and see if you see something suspect there?

@nealrichardson

nealrichardson commented Mar 24, 2020

Copy link
Copy Markdown
Member

Before I forget (high mental load average right now), here's how I reproduced it:

  1. Forget msys2 and all that. I used ursa-labs/arrow-r-nightly to build the C++ library for Windows on this branch and publish the artifact to bintray with an arbitrary version number (0.16.0.666666): https://github.com/ursa-labs/arrow-r-nightly/runs/531644004
  2. In my local Windows 10 VM with R and Rtools installed, (a) set the arbitrary version number in r/DESCRIPTION to match the thing I put on bintray, and (b) remotes::install_local("Z:\\arrow\\r") to install dependencies and then the package itself, compiling the bindings from source.
  3. Since this only crashed on 32-bit R, run that from the Command Prompt. C:\Program Files\R\R-3.6.0\bin\i386\R.exe
  4. Since running R with gdb on Windows sounds terrible, I instead ran the tests with more verbosity, using the "location" test reporter, so that I could see where exactly it died:
setwd("Z:\\arrow\\r\\tests/")
library("testthat")
test_check("arrow", reporter="location")

That gave a bunch of output, until it crashed and I was dumped back to the command prompt:

...
Start test: Simple interface for datasets (custom ParquetFileFormat)
  test-dataset.R#124:1 [success]
End test: Simple interface for datasets (custom ParquetFileFormat)

Start test: Hive partitioning
  test-dataset.R#129:1 [success]
  test-dataset.R#130:1 [success]
End test: Hive partitioning

Start test: Partitioning inference
  test-dataset.R#144:1 [success]
  test-dataset.R#145:1 [success]
  test-dataset.R#158:1 [success]
  test-dataset.R#159:1 [success]
End test: Partitioning inference

Start test: IPC/Arrow format data
  test-dataset.R#172:1 [success]
  test-dataset.R#173:1 [success]

C:\Program Files>

So the error was something happening after L173 in test-dataset.R.

@kou

kou commented Mar 25, 2020

Copy link
Copy Markdown
Member

"Dev / Source Release and Merge Script" job failure will be fixed by #6711.

@wesm

wesm commented Mar 25, 2020

Copy link
Copy Markdown
Member Author

Thank you @nealrichardson, that's definitely helpful information.

I looked at file_ipc.cc and didn't see anything especially concerning. I don't know if it's related but there are various usages of Result::ValueOrDie in the datasets implementation, we could turn on logging to see of one of those "shouldn't fail" conditions is occurring somehow

@wesm

wesm commented Mar 25, 2020

Copy link
Copy Markdown
Member Author

One change that this PR makes is making RecordBatchFileReader an abstract class with pure virtual methods. I don't see why that would cause a problem but...

@nealrichardson

Copy link
Copy Markdown
Member

Beats me. I'll leave it to you to decide whether to merge this and create a followup for that or whether you want to hold this to debug further. If it were me, I'd do the former: IPC-format Datasets aren't a critical feature. And I suspect you don't feel like debugging 32-bit windows any further either ;)

@wesm

wesm commented Mar 25, 2020

Copy link
Copy Markdown
Member Author

OK, I'll merge this. Is there someone who can help us get a backtrace with gdb so we can try to find out what is happening?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants